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The importance of emotion recognition lies in the role that emotions play in our everyday 
lives. Emotions have a strong relationship with our behavior. Thence, automatic emotion 
recognition, is to equip the machine of this human ability to analyze, and to understand 


the human emotional state, in order to anticipate his intentions from facial expression. In 
this paper, a new approach is proposed to enhance accuracy of emotion recognition from 
facial expression, which is based on input features deducted only from fiducial points. 
The proposed approach consists firstly on extracting 1176 dynamic features from image 
sequences that represent the proportions of euclidean distances between facial fiducial 
points in the first frame, and faicial fiducial points in the last frame. Secondly, a feature 
selection method is used to select only the most relevant features from them. Finally, the 
selected features are presented to a Neural Network (NN) classifier to classify facial ex- 
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Best First pression input into emotion. The proposed approach has achieved an emotion recognition 
accuracy of 99% on the CK+ database, 84.7% on the Oulu-CASIA VIS database, and 
93.8% on the JAFFE database. 
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1. INTRODUCTION 

As emotions play an implicit role in the communication process, and reflect human behavior, automatic 
emotion recognition is a task of growing interest. To recognize human emotions, a wide range of features can be 
used such as facial expression[1, 2], body gesture [2, 3], or speech [4, 5]. Giving a computer the capability of 
emotion recognition (ER) is the scientific challenge around which gather different communities (signal processing, 
image processing, artifcial intelligence, robotics, human-computer interaction ) 

Mehrabian [6] affirms that facial expression represents 55% of the nonverbal communication that allows 
to understand the state or the emotion of a person. The objective of this work is to get a computer to detect human 
emotions from facial expressions. 

Facial expression is the most important key to understand human emotions. In fact, not all facial expres- 
sions have a meaning and can be classified into emotions, but there are some basic emotions that are universal [7] 
and can be expressed in the same way, which are: happy, sad, fear, anger, disgust, and surprise. 

The main principles steps of facial expression recognition are generally: face detection, feature extraction, 
and facial expression classification. In the first step; we have to determine whether an image belongs to the class 
of faces or not. In the second step; we have to extract features or characteristics from the face that better describe 
emotions. In the last step; we have to classify the extracted features into basic emotions. However, usually the issue 
comes from the second step which is feature extraction. A set of features that better describes a facial expression 
movement must be found and used for classification. For this reason, the proposed technique in this paper is based 
on image sequences, and focused on calculating 1176 euclidean distances between all detected points to measure all 
possible deformations of the face, because, there may be distances more descriptive than others that appear visually 
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Face detection —> Feature extraction —> Feature selection -> Classification 


Figure 1. Steps of an emotion recognition system 


logic. Once we have calculated dynamic features from fiducial points; an additional step of feature selection process 
(see Figure 1) is used to reduce the number of features by choosing only the most relevant ones from them. 

The rest of the paper is organized as follows: an overview of related work is presented in section 2, our 
proposed method is presented in section 3, experimental setup is given in section 4, in section 5; results and a 
discussion of our proposed approach are presented with a comparison of recognition rate between previous works. 
Section 6 concludes the paper and presents some perspective researches. 


2. RELATED WORK 

In general, facial expression recognition system can be classified into two categories: geometric features 
based methods and global features based methods [8]. In geometric features-based methods, only some parts of the 
face are considered for feature extraction such as eyes, nose and mouth. Such methods consume a lot of computation 
time to obtain accurate results for facial features detection and tracking which is a major disadvantage. Besides, in 
global features based methods, the whole face is considered for feature extraction, as used in [9] where global 
features are extracted from face images using local Zernike moment. These methods are easy to use because they 
work directly on facial images to describe facial textures. 

There are a plethora of works that aim to facilitate the way of recognizing emotions from facial expression 
using static [10, 11, 12] or dynamic images [13, 14, 15, 16, 17, 18, 9]. 

By measuring dynamic facial motions in image sequences; Bassili [19] has confirmed that dynamic images 
give accurate results in facial expression recognition than single static ones. 

Friesen et al. [14] have proposed the FACS system that describes movements of the face, where forty 
four Action Units (AU) are defined, and each one represents a movement of a particular part of the face (e.g Brow 
Lowerer). According to Friesen et al., a facial expression could be characterized by a combination of AUs. 

To demonstrate that AUs are capable to perform emotion expressions, basori et al. [20] have generated an 
emotion expression of an avatar using combination of AUs based on facial muscle. 

Pantic et al. [15] have focused their work on recognizing facial action units (AUs) and their temporal 
models using profile-view face image sequences. To track 15 facial points in an input face profile sequences; they 
apply particle filtering method [21]. 

Valstar et al. [22] have proposed an automatic method to recognize 22 actions units (AUs) and their models 
using image sequences. Firstly, to automatically detect 20 fiducial points, they used Gabor-feature-based boosted 
classifier, then, these points were tracked through a sequence of images using a particle filtering method with fac- 
torized likelihoods. 

Pu et al. [16] have suggested a new framework for facial expression analysis based on recognizing AUs 
from image sequences. To detect and track fiducial points, they applied first AAM [23] to model the neutral facial 
expression in the first frame, after that they used pyramidal implementation of Lucas-Kanade [24] to track feature 
points in the others frames. They used two levels to classify facial expressions using random forest as method 
of classification. The first level consists of classifying AUs, taking as input the displacement vectors between the 
neutral expression frame and the peak expression frame. The second level consists of using as input the detected 
AUs to classify facial expressions. 

Most of facial expression recognition methods are based on AU-based method [15, 22, 16, 20]. They are 
often influenced by the FACS system proposed by Friesen et al.[14]. Nevertheless, there are also other techniques 
that are based only on fidicual points to recognize facial expression, which minimize computation time. 

Abdat et al. [17] have focused on another geometric method to detect facial expression. They have used 
twenty one distances to encode facial expressions; these distances describe facial features deformations compared 
to the neutral state. These methods are focused firstly on the algorithm of Shi&Thomasi to extract feature points, 
and secondly on the Lucas-Kanade algorithm [24] to track and detect points, after that the distance vector was used 
as a descriptor of the facial expression, which is calculated from image sequences. This vector is the input of SVM 
classifier. 

Hammal et al. [13] have developed a classifying system based on the belief theory, and applied it on the 
Hammal-Caplier database. They used five distances between different parts of the face (eybrow, both eyes and 
mouth). In their work, distances were computed on skeletons of expression from image sequences, however, only 
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four emotions (joy, surprise, disgust and neutral) were considered from the six basic emotions. 

Perveen et al. [10] have focused their work on three regions (eyebrows, eyes, mouth) to define an emotion 
from static images. First, they calculated the characteristic points of the face, then they tried to evaluate some 
animation parameters such as: the openness of eyes, the width of eyes, the height of eyebrows, the opening of 
mouth, and the width of mouth. As a classification technique, they used a decision tree based method, applied only 
on thirty images from the JAFFE database [25], and they recognized six emotions (happy, surprise, fear, sad, angry, 
and neutral) excluding the disgust emotion. 

Saeed et al. [26] have proposed an emotion recognition system based on just eight fiducial points. They 
represented six geometric features by measuring some distances between mouth, eyes, and eyebrows. These features 
represent the changes of the face during an emotion occurrence. Then, the features were presented to an SVM 
classifier for emotion recognition. The system was applied on Cohn-Kanade database (CK+) [27], and Binghamton 
University 3D Facial Expression Database [28] to recognize six basic emotions. 

Majumder et al. [29] have suggested an emotion recognition model based on the Kohonen self-organizing 
map (KSOM) that uses 26 dimensional facial geometric feature vector calculated from three parts of the face (lips, 
eyes and eyebrows) that describes the change of six basic emotions. The experience was applied on the MMI 
database [30]. 

The research studies cited above show that dynamic facial expressions from image sequences are more 
descriptive for the task of emotion recognition and can increase the accuracy in real time applications instead of 
using static images. 


3. PROPOSED METHOD 

This section presents and justifies our proposed technique for emotion recognition from facial expression. 
Our contribution concerns the feature extraction step, in which we have proposed to calculate all euclidean distances 
between fiducial points, in the first and in the last frames to measure facial motion. Firstly, we detect the face 
using ViolaJones algorithm [31], then, we detect and track 49 fiducial points using a powerful and recent Supervised 
Decent Method (SDM) proposed by Xiong et al. [32], and from these points that represent the four parts of the 
face (eyebrows, eyes, nose, and mouth), we calculate all possible distances between each pair of points, as a result, 
we get CZ, = 1176 euclidean distances. After that, to measure dynamic deformation related to the neutral state, we 
calculate the distance ratio that represents dynamic features, it is calculated between the first and the last frames 
(Section 2.1). Afterward, we use a feature selection method to reduce the number of features and to select only 
the most relevant ones. Finally, we present the selected dynamic features to a neural network classifier for facial 
expression recognition. 


3.1. Facial expression representation 


Once we have detected the face using Viola Jones algorithm [31], we have applied SDM method [32] to 
detect and track fiducial points in image sequences. 

To measure the face deformation, we have considered only the first and the last frames. Firstly, we have 
calculated all Euclidean distances (1) from 49 detected points that are represented by x and y coordinates (2) (3), 
and that refer to the parts of the face which are: 10 points for eyebrows, 12 points for eyes, 9 points for nose, and 18 
points for mouth. In the total, we have calculated 1176 distances in the first and in the last frames. Then, we have 
measured dynamic deformation by calculating the ratio (4) between frames. The ratio represents the division of the 
calculated distance of the peak frame by the same calculated distance of the first frame. The dynamic features (5) 
represent a vector of features that contains 1176 ratios calculated related to the neutral state.An overview of facial 
expression representation process is presented in Figure 2. 


D = [Dj, Də, ..., Di, ..., D4] (1) 

Vo = [10, Y10, £20, Y20; «++; Eno, Yno] (2) 

Vp = [Pips Yip; T2p, Y2p; 1) Paps Vag (3) 
D; 

Apres = 4 

Dio (4) 

DF = [AD;, ADs, ..., AD;] (5) 


Where 
n: The total number of fiducial points 
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t: The number of euclidean distances calculated between each pair of points 
Vo : x and y coordinates of detected points in the first frame 

Vp : x and y coordinates of detected points in the peak frame. 

Dip : Euclidean distance in the peak frame. 

Dio : Euclidean distance in the first frame. 


First frame Second frame Peak frame 


| 


Vo = [x10, y10, X20, Y20,...,Xn0, Yno] Vp =[X1p, Vip, X2p, Y2p,---Xnp, Yap | 
Do =[D10, D20,..., D. e Dp =[Dp, D2p,...,Dp] 
DF =[AD1,AD»,...,AD1] 


Figure 2. Dynamic features representation process 


3.2. Feature selection 


One of the key issues in emotion classification is the features used for prediction. For this reason; a feature 
selection step has been used to choose the most relevant features. Generally, the feature selection step combines an 
attribute subset evaluator with a search method. The attribute evaluator determines what method is used to assign a 
worth to each subset of features. The search method determines what style of search is performed [33]. 

In this paper, we have chosen as a feature evaluator the CfsSubsetEval, and Best First as search method 
implemented in weka. The CfsSubsetEval evaluator evaluates the worth of a subset of features by considering the 
individual predictive ability of each feature along with the degree of redundancy between them; subsets of features 
that are highly correlated with the class while having low inter-correlation are preferred. The Best first method 
searches the space of feature subsets by greedy hill-climbing augmented with a backtracking facility [33]. 


3.3. Classification 


A neural network (NN) classifier has been chosen to classify facial expressions based on dynamic features 
that are previously selected. It was trained on a multi-class emotion recognition task, using the backpropagation 
algorithmn, and the Sigmoid function as an activation function. Our NN is a signle network with one hidden layer. 
The first layer represents the input data which are the DF. The second one is the hidden layer, and the last one 
represents the output classes. The number of neurons in the hidden layer was chosen experimentally. 


4. EXPERIMENTAL SETUP 

The experiments of our work was conducted on three known facial expression databases: Extended Cohn- 
Kanade (CK+) database [34, 27], Oulu-CASIA VIS database [35] database, and JAFFE database [25]. 

The CK+ database [34, 27] contains 327 labeled image sequences that refer to one of seven expressions, 
i.e., anger, contempt, disgust, fear, happiness, sadness, and surprise. For each image sequences, only the last frame 
is provided with an expression label. This database is detailed as follows: 45 images of angry expression, 59 images 
of disgust expression, 25 images of fear expression, 69 images of happy expression, 28 images of sad expression, 
and 83 images of surprise expression. 

The Oulu-CASIA VIS database [35] contains different light conditions, we have used the strong and good 
lighting onces that contains 80 subjects. Facial expressions are made by each subject and refer to the six basic 
expressions (anger, disgust, fear, happiness, sadness, and surprise). In total we have 480 expression labeled image 
sequences. 
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The JAFFE database [25] contains 213 images from 10 Japanese female subjects. Each subject has 3 
or 4 examples of each of the six basic expressions (anger, disgust, fear, happiness, sadness, surprise and neutral 
expression). This database is detailed as follows: 30 images of angry expression, 29 images of disgust expression, 
32 images of fear expression, 31 images of happy expression, 31 images of sad expression, 30 images of surprise 
expression, and 30 images of neutral expression. 


4.1. Training process 


In our work, we have proceeded with three experiments to remark the influence of each used detail on the 
emotion recognition accuracy. All experiments was conducted on the three databases (CK+, Oulu-CASIA VIS, and 
JAFFE), and each one has been divided into 60% for training, 10% for validation, and 30% for test, with a NN of 
20 neurons in the hidden layer in all experiments. 

The first experiment consists firstly; on omitting the feature selection step, and using the DF (5) directly 
as input to our classifier. Therefore, the NN classifier takes 1176 features as input, and six or seven classes in the 
output that depend on the number of classes presented in the used database. Secondly; on using the feature selection 
step. Thus, after calculating DF (5) for each image sequences presented in each used database; we have applied 
feature selection method on the three databases to reduce the number of features. First, we have combined the 
CK+ , the Oulu-CASIA VIS and the JAFFE databases in one database that contains 1020 data and refers to the eight 
expressions (anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral). Then, we have applied feature 
selection method to this new database in order to select the common and only the relevant features. As result, we 
have reduced our features from 1176 to 83 features. Last, we have trained three classifiers on the three databases, 
each one apart. The NN classifier takes 83 features as input, and six or seven classes in the output. 

In the second experiment, we have tried to observe the ability of classifying new image sequences, the 
classifier trained on the CK+ was tested on the Oulu-CASIA VIS and the JAFFE databases, and vice versa. 

The last experiment consists firstly on unifying the three databases; that means to delete the emotions that 
don’t appear in other databases and keep the common ones. However; it will remain only 309 and 183 image 
sequences for the CK+ and the JAFFE databases respectively. Secondly, it consists on testing each classifier trained 
on one database, on the two other databases by varying the size of the training set , and showing how that influences 
the emotion recognition accuracy. 


5. RESULTS & DISCUSSION 

Table 1 summarizes a comparaison between the results achieved in our first experiment and those achieved 
by Pu et al. [16] using random forest. The third column presents emotion recognition accuracy achieved using 
directly the DF calculated. The last column presents emotion recognition accuracy using feature selection step 
where only 83 features are used from 1176. 

The obtained results show that our method outperforms AU based method proposed by [16] whether the 
feature selection process is used or not. Nevertheless, the use of feature selection process allows to take a less 
number of features and gives better results than when using DF directly. 


Table 1. Comparison of emotion recognition accuracy with and without the use of feature selection process 


Pu et al. [16] Our approach 
Without FS With FS 
CK+ 96.3 98 99 
OULU-CASIA VIS 76.25 81.3 84.7 
JAFFE - 90.6 93.8 


Table 2 presents the achieved results by the three classifiers trained separately on the CK+, the Oulu-CASIA 
VIS, and the JAFFE databases, in the second experiment. The first classifier which was trained on the CK+ database 
using always 83 features, gives an emotion recognition accuracy of 67.29% and 43.66% on the Oulu-CASIA VIS, 
and the JAFFE databases respectively. The second classifier which was trained on the Oulu-CASIA VIS database, 
gives an emotion recognition accuracy of 90.52% and 46.48% on the CK+, and the JAFFE databases respectively. 
The third classifier which was trained on the JAFFE database, gives an emotion recognition accuracy of 64.22% and 
47.29% on the CK+, and the Oulu-CASIA VIS databases respectively. We have obtained a competitive emotion 
recognition accuracy by the second classifier which was trained on the Oulu-CASIA VIS database, unlike classifiers 
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Table 2. Test the trained classifier on other databases 


Testing 
CK+ OULU-CASIA VIS JAFFE 
CK+ - 67.29 43.66 
Training OULU-CASIA VIS 90.52 - 46.48 
JAFFE 64.22 47.29 - 


which were trained on the CK+, and the JAFFE databases. However, this decrease of results can be justified firstly 
by the training size of the CK+ database (196 image sequences) and the training size of the JAFFE database (127 
image sequences) comparatively to the size of the Oulu-CASIA VIS database (288 image sequences), secondly, by 
the expressions which they do not exist on all databases, knowing that the contempt expression is present on the CK+ 
with an occurrence of 18, but not in other two databases, likewise, in the JAFFE database 30 neutral expressions are 
considered as emotion class. Therefore, the classifiers trained on a supplementary expressions that do not exist in 
all databases cause a decreasing of emotion recognition accuracy, for this reason, we have proceeded with the third 
and last experiment to unify all used databases and show how that influence our results. 


100 
aE AIT gn: 109 


—— Oulu-CASIA VIS ——s 


a 90 4 90 


== Oulu CASIA VIS 


= ck 


=k 
JAFFE 


30% 40% 50% 60% 70% 80% 90% 30% 40% 50% 60% 70% 80% 90% 30% 40% 50% 50% 70% 80% 90% 


(a) Training on the CK+ (b) Training on the OULU-CASIA VIS (c) Training on the JAFFE 


Figure 3. Training our proposed method with one database and testing it with another databases 


Figure 3 shows how the training size and unified databases mark an increase of emotion recognition ac- 
curacy over all databases. Figure3 (a) shows that emotion recognition accuracy increase from 67.29% to 72.08% 
and from 43.66% to 56.83% in the Oulu-CASIA VIS and the JAFFE databases respectively. Figure3 (b) shows 
that emotion recognition accuracy increase from 90.52% to 96.44% and from 46.48% to 53.55% in the CK+ and 
the JAFFE databases respectively. Figure3 (c) shows that emotion recognition accuracy increase from 64.22% to 
72.49% and from 47.29% to 51.25% in the CK+ and the Oulu-CASIA VIS databases respectively. 


6. CONCLUSION & FUTURE WORK 

In this research work, we have proposed an automatic approach for facial expression recognition task. Our 
approach was tested using dynamic features that are calculated from the first and the last frames which represent 
respectively the neutral state, and an emotional state. After detecting the face and fiducial points in the first and 
the last frames; all possible euclidian distances have been calculated between each pair of points. For that, we have 
calculated 1176 distances, then, to measure the deformation; each calculated distance of the first frame is divided 
by the same calculated distance of the peak frame. After that, we have used a feature selection process to reduce the 
number of features by choosing only the most relevant ones from them. In the last step of our proposed approach, 
we have presented the selected dynamic features to a neural network classifier for facial expression recognition. 

Evaluating this approach on three known databases has given encouraging results using neural network 
classifier, with an emotion recognition accuracy of 99% on the CK+ database, 84.7% on the Oulu-CASIA VIS 
database, and 93.8% on the JAFFE database. 

In our future work we will continue developing our proposed system along several axes. Firstly, we will 
investigate the possibility of adding other features that represent the pose of the face. Secondly, we also intend to 
consider another source to recognize emotions, which is the intonation of voice, using acoustic parameters. Finally, 
our ultimate aim is to combine the two sources, which are facial expression and voice intonation, to automatically 
recognize emotions from multimodal data using new approaches of deep learning classification. 
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